blobstore: support parallel download option and createParentPath option in AWS and GCP#377
blobstore: support parallel download option and createParentPath option in AWS and GCP#377LihaoLiuXs wants to merge 17 commits intosalesforce:mainfrom
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (37.66%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #377 +/- ##
============================================
- Coverage 82.10% 81.49% -0.62%
- Complexity 641 643 +2
============================================
Files 194 194
Lines 11798 11994 +196
Branches 1572 1605 +33
============================================
+ Hits 9687 9774 +87
- Misses 1434 1530 +96
- Partials 677 690 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This looks solid. Could you share the testing logs or a brief summary of how this was verified on |
| import software.amazon.awssdk.http.apache.ApacheHttpClient; | ||
| import software.amazon.awssdk.http.apache.ProxyConfiguration; | ||
| import software.amazon.awssdk.regions.Region; | ||
| import software.amazon.awssdk.services.s3.S3AsyncClient; |
There was a problem hiding this comment.
that's a bit odd - we shouldn't mix sync and async in here. AwsBlobStore in meant for sync operations
There was a problem hiding this comment.
move the parallel download option to asyncClient.
| import software.amazon.awssdk.transfer.s3.S3TransferManager; | ||
| import software.amazon.awssdk.transfer.s3.model.DownloadFileRequest; |
sandeepvinayak
left a comment
There was a problem hiding this comment.
if the client wants to do parallel downloads - they need to use async client. We shouldn't mix these two - it will get into the high maintainance sphagetti code.
Agreed. I moved the parallel download option to asyncClient. |
| * When createParentPath is enabled, resolves the final file path by appending | ||
| * the object key to the destination root and creating any missing parent directories. | ||
| */ | ||
| private Path resolveDownloadDestinationPath(DownloadRequest request, Path destination) { |
There was a problem hiding this comment.
it;s not resolving, it's creating.
There was a problem hiding this comment.
Updated the name.
| Path destinationPath = resolveDownloadDestinationPath(request, path); | ||
| GetObjectRequest getObjectRequest = transformer.toRequest(request); | ||
| if (request.isParallelDownload()) { | ||
| DownloadFileRequest downloadFileRequest = |
There was a problem hiding this comment.
we should move it to transformer
There was a problem hiding this comment.
Moved this to transformer.
| } | ||
|
|
||
| /** | ||
| * (Optional) If true, provider implementations may preserve the key's parent path structure in |
There was a problem hiding this comment.
i think it creates paths and we should explicitly mention it
There was a problem hiding this comment.
Update the comment for this
|
|
||
| /** | ||
| * Worker count for {@link TransferManager} divide-and-conquer downloads, analogous to tuning | ||
| * gsutil sliced / parallel Range GET downloads for large objects (see Google Cloud Performance |
There was a problem hiding this comment.
let's remove any unlrelated details,we have nothing to do gsutil
There was a problem hiding this comment.
Remove the unrelated comments
| * gsutil sliced / parallel Range GET downloads for large objects (see Google Cloud Performance | ||
| * Atlas: Optimizing your Cloud Storage performance). | ||
| */ | ||
| private static final int PARALLEL_LARGE_FILE_DOWNLOAD_MAX_WORKERS = 8; |
There was a problem hiding this comment.
how do we chose 8 ?
There was a problem hiding this comment.
8 workers will give us 8 workers × 16 MiB chunks ≈ 800 MB/s aggregate ≈ 6.4 Gbps, which saturates the GCE Compute Engine default per-VM egress cap (7 Gbps).
-
For machine series that support multiple physical NICs, such as A3, A4, A4X, and A4X Max instances, 7 Gbps per NIC: https://docs.cloud.google.com/compute/docs/network-bandwidth#egress_outside_vpc
-
Number of work options: 4, 8, 16, 22, 32, 48. transfer_manager sliced download is slow on cloud run googleapis/python-storage#1093
| */ | ||
| private static final int PARALLEL_LARGE_FILE_DOWNLOAD_MAX_WORKERS = 8; | ||
|
|
||
| private static final String PARALLEL_STRIP_PREFIX_TO_BASENAME = "^.*/"; |
There was a problem hiding this comment.
the var name and value doesn't align
There was a problem hiding this comment.
This value matches everything up to and including the last '/' in an object key. For example for object key data/2026/04/report.csv, "^.*/" will match the data/2026/04/, and as result we will only keep the report.csv. I will rename this var to OBJECT_KEY_DIRECTORY_PREFIX_REGEX.
| // parallelDownload not supported: delegates to OutputStream which | ||
| // cannot accept parallel range-GET writes. |
There was a problem hiding this comment.
lets remove this comment, the docs are added in abstract
| private Path resolveDownloadDestinationPath(DownloadRequest request, Path destination) { | ||
| if (!request.isCreateParentPath()) { | ||
| return destination; | ||
| } | ||
| // Keep key parent structure under the provided local destination root when requested. | ||
| Path resolved = destination.resolve(request.getKey()).normalize(); | ||
| Path parent = resolved.getParent(); | ||
| if (parent != null) { | ||
| try { | ||
| Files.createDirectories(parent); | ||
| } catch (IOException e) { | ||
| throw new SubstrateSdkException("Failed to create destination directories", e); | ||
| } | ||
| } | ||
| return resolved; | ||
| } |
There was a problem hiding this comment.
since this is already common, should go to abstract class
There was a problem hiding this comment.
Moved this to AbstractBlobStore and AbstractAsyncBlobStore.
Summary
Moved the parallelDownd option to asyncClient. Since we don't have the conformance test for async client. I tested it locally with example.
GCP test result:

BUCKET_NAME=palsfdc mvn -q exec:java -Dexec.mainClass=com.salesforce.multicloudj.blob.Main -Dbucket.name=palsfdc 2>&1
=== STARTING PARALLEL DOWNLOAD TESTS ===
Provider: gcp
=== Testing Parallel Download (Path) ===
SLF4J(W): No SLF4J providers were found.
SLF4J(W): Defaulting to no-operation (NOP) logger implementation
SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J(W): Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J(W): Ignoring binding found at [jar:file:/Users/barry.liu/.m2/repository/org/slf4j/slf4j-simple/1.7.2/slf4j-simple-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J(W): See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
[parallelDownloadToPath] Uploading test blob: key=parallel-download-path-1776870394493.bin, size=16777216
[parallelDownloadToPath] key=parallel-download-path-1776870394493.bin, bytes=16777216, elapsedMs=1721, etag=CNLan4begZQDEAE=
[parallelDownloadToPath] SUCCESS
=== Testing Parallel Download (createParentPath) ===
[parallelDownloadWithCreateParentPath] Uploading nested-key blob: key=parallel/nested/dir/blob-1776870398374.bin
[parallelDownloadWithCreateParentPath] resolvedPath=/tmp/parallel-download-root-1776870398374/parallel/nested/dir/blob-1776870398374.bin, bytes=8388608, etag=COLJzofegZQDEAE=
[parallelDownloadWithCreateParentPath] SUCCESS
=== Testing Parallel Download (with Range) ===
[parallelDownloadWithRange] key=parallel-download-range-1776870400563.bin, bytes=3073, expected=3073
[parallelDownloadWithRange] SUCCESS
=== Testing Parallel vs Sequential Download ===
[parallelVsSequentialDownload] sequentialMs=2349, parallelMs=2422, sizeBytes=33554432, seqDownloadedBytes=33554432, parDownloadedBytes=33554432
[parallelVsSequentialDownload] SUCCESS
Parallel download tests completed!
AWS test result:
AWS_ACCESS_KEY_ID=ASIAZQ3DQ6RHW2RLTYBV AWS_SECRET_ACCESS_KEY=ptpSaDk3289DvAAa1a2nWydKWorUHOCv6C5EZirZ AWS_SESSION_TOKEN=IQoJb3JpZ2luX2VjEIj//////////wEaCXVzLWVhc3QtMSJHMEUCIQDkYobo2adHLDVDa+nGaRy4fMqPI+Kn5JXDOQuNqtJbCwIgGG0YDYe4mssxmxR0SXi05yynyLm9iMOoqCIgmBgBYz8qoQMIURAAGgw2NTQ2NTQzNzA4OTUiDDehr5JzqAAf6rhQNir+AhcU5dA7Rvp02CwPTI5f2u6StBC4zzKXwhmfHSFvhDr94unO6627FuL2s0TZTqoTWzCiMPeUlDbQsAECdnbMx+2WUyNNcMx3arePqqaH2WBTY8CnSTLEYIZ/JKUrvs9oSCC8gw79hW37tlQVN6fJWlHu4LnXQcECnt6NUCGH0TC4P3udxZs52GZ7cNxyxhhT8XULEuaz2iGPmDbVNW/aLT8ACAxZqnuYS0P0s/Fu7RWdOI2JW2Yjavt6+rq/ndDMzZpBKXTfGxCV0i9XPIpLhRA2VXAI9gZ5vqNSXVPpwhSFRViATPvppkHFOZ0j2K0ZnqOzo0ZgwTzcbPemo1e1retBNX0p5o7q6CIpwdcF549NmVvAGmvrPRQuYWEJUnMKRvTkNgPR0B6qCT4Nnr0zm4JJwXYpgPtns9gzfIngj/mjFxIK2reP92G/YKh/DDGeFt21sK93SyZmDk50qu+58TKwrNcMG88O9+hkltvhySc+o8b8Gu5cpbyCTZiggLIwy92jzwY63gFbDq00OpQM0b4/EsIVRKQLF6ONqU+QYqaRX1I9n/4TgKLz1/vYVOwb34450QreE8PakIhIQlnfPv2JNlQh8SzLmOGSF8BHX1+3rh+pEU9rPTo3mjJXts18LMtnFwP/ExOJ/A9dgWdQzRgX6GN2akSgC5pdnpoKZaGMVm5Ft7fU96EGl3D1cUiwcmrGHBHCbGeeND54tMoA+Xx/XhO7uZjdwlHHjMKgJuZmmOp76Nd/IFdBZ7Gfw9kfXfZdinVS8ilxzOJFzhvM0/jBIEL1fDFxmUgXKSUEaa0pEAiXOqg= AWS_REGION=us-west-2 BUCKET_NAME=example-test-bucket-barry BUCKET_REGION=us-west-2 mvn -q exec:java -Dexec.mainClass=com.salesforce.multicloudj.blob.Main -Dprovider=aws -Dbucket.name=example-test-bucket-barry -Dbucket.region=us-west-2 2>&1 | grep -vE '^\s+at |Caused by:' | tail -60
=== STARTING PARALLEL DOWNLOAD TESTS ===
Provider: aws
=== Testing Parallel Download (Path) ===
SLF4J(W): No SLF4J providers were found.
SLF4J(W): Defaulting to no-operation (NOP) logger implementation
SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J(W): Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J(W): Ignoring binding found at [jar:file:/Users/barry.liu/.m2/repository/org/slf4j/slf4j-simple/1.7.2/slf4j-simple-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J(W): See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
[parallelDownloadToPath] Uploading test blob: key=parallel-download-path-1776873590056.bin, size=16777216
[parallelDownloadToPath] key=parallel-download-path-1776873590056.bin, bytes=16777216, elapsedMs=2067, etag="e06888832b83d4f381e885edebd1d782"
[parallelDownloadToPath] SUCCESS
[parallelDownloadToPath] remote blob retained for inspection: parallel-download-path-1776873590056.bin
=== Testing Parallel Download (createParentPath) ===
[parallelDownloadWithCreateParentPath] Uploading nested-key blob: key=parallel/nested/dir/blob-1776873595717.bin
[parallelDownloadWithCreateParentPath] resolvedPath=/tmp/parallel-download-root-1776873595717/parallel/nested/dir/blob-1776873595717.bin, bytes=8388608, etag="57b019a28c426df5727b3992701bd2be"
[parallelDownloadWithCreateParentPath] SUCCESS
[parallelDownloadWithCreateParentPath] remote blob retained for inspection: parallel/nested/dir/blob-1776873595717.bin
=== Testing Parallel Download (with Range) ===
[parallelDownloadWithRange] key=parallel-download-range-1776873598786.bin, bytes=3073, expected=3073
[parallelDownloadWithRange] SUCCESS
[parallelDownloadWithRange] remote blob retained for inspection: parallel-download-range-1776873598786.bin
=== Testing Parallel vs Sequential Download ===
[parallelVsSequentialDownload] sequentialMs=3321, parallelMs=3580, sizeBytes=33554432, seqDownloadedBytes=33554432, parDownloadedBytes=33554432
[parallelVsSequentialDownload] SUCCESS
[parallelVsSequentialDownload] remote blob retained for inspection: parallel-vs-sequential-1776873600509.bin
Parallel download tests completed!
Some conventions to follow
docstore:for document store module,blobstorefor Blob Store moduletest:perf: